Skip to content

Conversation

@yyDing1
Copy link
Collaborator

@yyDing1 yyDing1 commented Oct 4, 2025

The current reward model implementation faces the following challenges:

  1. Model Support: It is primarily designed for discriminative models and lacks robust support for generative reward models.
  2. Complexity: It relies on heavy-weight backends like FSDP or Megatron, which are often unnecessary for typical reward model inference tasks.
  3. Flexibility: The batch-level synchronization mechanism hinders the implementation of more flexible, sample-level reward functions for the developers.

What this PR does

To address these issues, this PR introduces a more flexible and easy-to-use reward model design. Specifically, it implements two main classes: RewardModelManager and RewardManagerWorker, with some runnable scripts in recipe/fapo.

image
  • RewardModelManager first launches multiple reward servers and then adopts a router-based approach to manage these servers (using SGLang Router), distributing requests to reward servers.
  • RewardManagerWorker retrieves the remote actor handle, providing users with greater flexibility in designing custom reward functions. For example, users can easily implement a customized reward function like the following:
async def compute_score(
    data_source: str,
    solution_str: str,
    ground_truth: str,
    extra_info: dict,
    reward_router_address: str,
    reward_model_tokenizer: PreTrainedTokenizer,
):
    # Compute rule-based reward score
    rule_based_score = ...

    # Compute GRM reward score
    grm_prompts = ...
    grm_prompt_ids = ...
    # Users can directly call the reward model
    grm_outputs = post(f"{http://{reward_router_address}/generate}", ...)  # post request to reward router
    ...

    # Final reward score
    final_score = ...

    return final_score

This implementation provides a reward_model interface in the compute_score method, maximizing flexibility and convenience for algorithmic design.

Note that this is an asynchronous function, so efficiency is not a concern—each sample is processed asynchronously.

Integration with AgentLoop

This PR introduces asynchronous reward computation for individual samples (async def run_single(self, data: DataProto) -> dict) and leverages an event loop to handle reward computation in parallel, significantly improving processing efficiency.

Moreover, this implementation can be integrated with agentloop for improved efficiency (has been implemented):

image

In this mode, the reward model operates independently from the rollout process (standalone mode), enabling a natural async data flow where each sample undergoes reward rollout immediately after actor rollout.

With this implementation, code redundancy is reduced in the existing reward model while maximizing flexibility for user-customized reward functions.

Runable Scripts

A runnable example is provided in recipe/fapo/. The newly introduced parameters for this implementation are placed in fapo/config and will be integrated into the main codebase upon completion of the refactoring.

@wuxibin89
Copy link
Collaborator

wuxibin89 commented Oct 17, 2025

Does sglang wake_up/sleep works in colocated mode? I observed that sglang seems resume with random weights in colocated mode.
https://github.com/volcengine/verl/blob/main/verl/workers/rollout/sglang_rollout/async_sglang_server.py#L183-L188

@yyDing1
Copy link
Collaborator Author

yyDing1 commented Oct 17, 2025

Does sglang wake_up/sleep works in colocated mode? I observed that sglang seems resume with random weights in colocated mode. https://github.com/volcengine/verl/blob/main/verl/workers/rollout/sglang_rollout/async_sglang_server.py#L183-L188

Yes, I checked the related issues and found that the same phenomenon was mentioned in sgl-project/sglang#6367 (comment). In short, since RL normally uploads a new set of parameters, sglang simply discards the old ones to speed up. I also looked into the recent PR sgl-project/sglang#10873, which seems to add support for reusing the original weights by keeping a stored copy.

@yyDing1
Copy link
Collaborator Author

yyDing1 commented Oct 17, 2025

The PR seems to be included in sglang 0.5.3.
So we may pass enable_weights_cpu_backup = True when launching the sglang servers (for reward models) to resolve this issue.

@wuxibin89 wuxibin89 merged commit d55929f into volcengine:main Oct 21, 2025
82 of 85 checks passed
@yyDing1 yyDing1 deleted the fapo branch October 23, 2025 09:04
wangboxiong320 pushed a commit to wangboxiong320/verl that referenced this pull request Nov 1, 2025
…olcengine#3679)

The current reward model implementation faces the following challenges:

1. Model Support: It is primarily designed for discriminative models and
lacks robust support for generative reward models.
2. Complexity: It relies on heavy-weight backends like FSDP or Megatron,
which are often unnecessary for typical reward model inference tasks.
3. Flexibility: The batch-level synchronization mechanism hinders the
implementation of more flexible, sample-level reward functions for the
developers.

### What this PR does

To address these issues, this PR introduces a more flexible and
easy-to-use reward model design. Specifically, it implements two main
classes: `RewardModelManager` and `RewardManagerWorker`, with some
runnable scripts in `recipe/fapo`.

<img width="1732" height="1188" alt="image"
src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db"
/>

- `RewardModelManager` first launches multiple reward servers and then
adopts a router-based approach to manage these servers (using [SGLang
Router](https://docs.sglang.ai/advanced_features/router.html)),
distributing requests to reward servers.
- `RewardManagerWorker` retrieves the remote actor handle, providing
users with greater flexibility in designing custom reward functions. For
example, users can easily implement a customized reward function like
the following:

```python
async def compute_score(
    data_source: str,
    solution_str: str,
    ground_truth: str,
    extra_info: dict,
    reward_router_address: str,
    reward_model_tokenizer: PreTrainedTokenizer,
):
    # Compute rule-based reward score
    rule_based_score = ...

    # Compute GRM reward score
    grm_prompts = ...
    grm_prompt_ids = ...
    # Users can directly call the reward model
    grm_outputs = post(f"{http://{reward_router_address}/generate}", ...)  # post request to reward router
    ...

    # Final reward score
    final_score = ...

    return final_score
```

This implementation provides a `reward_model` interface in the
`compute_score` method, maximizing flexibility and convenience for
algorithmic design.

Note that this is an asynchronous function, so efficiency is not a
concern—each sample is processed asynchronously.

### Integration with AgentLoop

This PR introduces asynchronous reward computation for individual
samples (`async def run_single(self, data: DataProto) -> dict`) and
leverages an event loop to handle reward computation in parallel,
significantly improving processing efficiency.

Moreover, this implementation can be integrated with `agentloop` for
improved efficiency (has been implemented):

<img width="2362" height="1280" alt="image"
src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743"
/>

In this mode, the reward model operates independently from the rollout
process (standalone mode), enabling a natural async data flow where each
sample undergoes reward rollout immediately after actor rollout.

With this implementation, code redundancy is reduced in the existing
reward model while maximizing flexibility for user-customized reward
functions.

### Runable Scripts

A runnable example is provided in `recipe/fapo/`. The newly introduced
parameters for this implementation are placed in `fapo/config` and will
be integrated into the main codebase upon completion of the refactoring.
NenoL2001 pushed a commit to NenoL2001/verl that referenced this pull request Nov 3, 2025
…olcengine#3679)

The current reward model implementation faces the following challenges:

1. Model Support: It is primarily designed for discriminative models and
lacks robust support for generative reward models.
2. Complexity: It relies on heavy-weight backends like FSDP or Megatron,
which are often unnecessary for typical reward model inference tasks.
3. Flexibility: The batch-level synchronization mechanism hinders the
implementation of more flexible, sample-level reward functions for the
developers.

### What this PR does

To address these issues, this PR introduces a more flexible and
easy-to-use reward model design. Specifically, it implements two main
classes: `RewardModelManager` and `RewardManagerWorker`, with some
runnable scripts in `recipe/fapo`.

<img width="1732" height="1188" alt="image"
src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db"
/>

- `RewardModelManager` first launches multiple reward servers and then
adopts a router-based approach to manage these servers (using [SGLang
Router](https://docs.sglang.ai/advanced_features/router.html)),
distributing requests to reward servers.
- `RewardManagerWorker` retrieves the remote actor handle, providing
users with greater flexibility in designing custom reward functions. For
example, users can easily implement a customized reward function like
the following:

```python
async def compute_score(
    data_source: str,
    solution_str: str,
    ground_truth: str,
    extra_info: dict,
    reward_router_address: str,
    reward_model_tokenizer: PreTrainedTokenizer,
):
    # Compute rule-based reward score
    rule_based_score = ...

    # Compute GRM reward score
    grm_prompts = ...
    grm_prompt_ids = ...
    # Users can directly call the reward model
    grm_outputs = post(f"{http://{reward_router_address}/generate}", ...)  # post request to reward router
    ...

    # Final reward score
    final_score = ...

    return final_score
```

This implementation provides a `reward_model` interface in the
`compute_score` method, maximizing flexibility and convenience for
algorithmic design.

Note that this is an asynchronous function, so efficiency is not a
concern—each sample is processed asynchronously.

### Integration with AgentLoop

This PR introduces asynchronous reward computation for individual
samples (`async def run_single(self, data: DataProto) -> dict`) and
leverages an event loop to handle reward computation in parallel,
significantly improving processing efficiency.

Moreover, this implementation can be integrated with `agentloop` for
improved efficiency (has been implemented):

<img width="2362" height="1280" alt="image"
src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743"
/>

In this mode, the reward model operates independently from the rollout
process (standalone mode), enabling a natural async data flow where each
sample undergoes reward rollout immediately after actor rollout.

With this implementation, code redundancy is reduced in the existing
reward model while maximizing flexibility for user-customized reward
functions.

### Runable Scripts

A runnable example is provided in `recipe/fapo/`. The newly introduced
parameters for this implementation are placed in `fapo/config` and will
be integrated into the main codebase upon completion of the refactoring.
AlexJJ009 pushed a commit to AlexJJ009/verl that referenced this pull request Nov 5, 2025
…olcengine#3679)

The current reward model implementation faces the following challenges:

1. Model Support: It is primarily designed for discriminative models and
lacks robust support for generative reward models.
2. Complexity: It relies on heavy-weight backends like FSDP or Megatron,
which are often unnecessary for typical reward model inference tasks.
3. Flexibility: The batch-level synchronization mechanism hinders the
implementation of more flexible, sample-level reward functions for the
developers.

### What this PR does

To address these issues, this PR introduces a more flexible and
easy-to-use reward model design. Specifically, it implements two main
classes: `RewardModelManager` and `RewardManagerWorker`, with some
runnable scripts in `recipe/fapo`.

<img width="1732" height="1188" alt="image"
src="https://github.com/user-attachments/assets/50fa8358-483c-44af-ba7b-3b696306c3db"
/>

- `RewardModelManager` first launches multiple reward servers and then
adopts a router-based approach to manage these servers (using [SGLang
Router](https://docs.sglang.ai/advanced_features/router.html)),
distributing requests to reward servers.
- `RewardManagerWorker` retrieves the remote actor handle, providing
users with greater flexibility in designing custom reward functions. For
example, users can easily implement a customized reward function like
the following:

```python
async def compute_score(
    data_source: str,
    solution_str: str,
    ground_truth: str,
    extra_info: dict,
    reward_router_address: str,
    reward_model_tokenizer: PreTrainedTokenizer,
):
    # Compute rule-based reward score
    rule_based_score = ...

    # Compute GRM reward score
    grm_prompts = ...
    grm_prompt_ids = ...
    # Users can directly call the reward model
    grm_outputs = post(f"{http://{reward_router_address}/generate}", ...)  # post request to reward router
    ...

    # Final reward score
    final_score = ...

    return final_score
```

This implementation provides a `reward_model` interface in the
`compute_score` method, maximizing flexibility and convenience for
algorithmic design.

Note that this is an asynchronous function, so efficiency is not a
concern—each sample is processed asynchronously.

### Integration with AgentLoop

This PR introduces asynchronous reward computation for individual
samples (`async def run_single(self, data: DataProto) -> dict`) and
leverages an event loop to handle reward computation in parallel,
significantly improving processing efficiency.

Moreover, this implementation can be integrated with `agentloop` for
improved efficiency (has been implemented):

<img width="2362" height="1280" alt="image"
src="https://github.com/user-attachments/assets/4297428d-194b-4c6f-aff1-69daf02ca743"
/>

In this mode, the reward model operates independently from the rollout
process (standalone mode), enabling a natural async data flow where each
sample undergoes reward rollout immediately after actor rollout.

With this implementation, code redundancy is reduced in the existing
reward model while maximizing flexibility for user-customized reward
functions.

### Runable Scripts

A runnable example is provided in `recipe/fapo/`. The newly introduced
parameters for this implementation are placed in `fapo/config` and will
be integrated into the main codebase upon completion of the refactoring.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants